Understanding the structure of data

R
Variable Assignment
Analysis
Author

James Peters

Published

November 24, 2023

Welcome! Enjoy reading my very first post.

Understanding the structure of data

R encompasses a diverse array of fundamental data types, each meticulously crafted to cater to specific requisites within the domains of statistical computing, data analysis, and programming. This discourse elucidates key data types in R, delineating their characteristics and contextual relevance in professional and academic settings:

  1. Numeric:

    • Signifying real numbers in floating-point notation.

    • Example: x <- 3.14

  2. Integer:

    • Denoting whole numbers.

    • Example: y <- 42L

  3. Character:

    • Representing textual or string data.

    • Example: name <- "John"

  4. Logical:

    • Expressing Boolean values (TRUE or FALSE).

    • Example: is_valid <- TRUE

  5. Complex:

    • Conveying complex numbers.

    • Example: z <- 2 + 3i

  6. Raw:

    • Characterizing a vector of bytes.

    • Example: raw_data <- charToRaw("ABC")

  7. Factor:

    • Capturing categorical data with distinct levels.

    • Employing the factor() function.

    • Example: gender <- factor(c("male", "female", "male"))

  8. Date:

    • Encoding date values.

    • Example: today <- as.Date("2023-01-01")

  9. POSIXct and POSIXlt:

    • Representing date-time values with varying levels of precision.

    • Example: current_time <- Sys.time()

We will explore the different structures using the Bike share data frame in the ISLR2 library. The Bike share data frame contains 8,645 observations across 15 variables. Each variable represents a specific aspect of the bike-sharing data. Here is an overview of the variables:

  1. season:

    • Numeric variable representing the season.

    • Example: 1, 2, 3, 4.

  2. mnth:

    • Factor variable with 12 levels representing the month.

    • Example: “Jan”, “Feb”, “March”, ..., “Dec”.

  3. day:

    • Numeric variable representing the day.
  4. hr:

    • Factor variable with 24 levels representing the hour.

    • Example: “0”, “1”, “2”, ..., “23”.

  5. holiday:

    • Numeric variable indicating whether it is a holiday (0 or 1).
  6. weekday:

    • Numeric variable representing the day of the week.
  7. workingday:

    • Numeric variable indicating whether it is a working day (0 or 1).
  8. weathersit:

    • Factor variable with 4 levels representing weather conditions.

    • Example: “clear”, “cloudy/misty”, ...

  9. temp:

    • Numeric variable representing the temperature.
  10. atemp:

    • Numeric variable representing the “feels-like” temperature.
  11. hum:

    • Numeric variable representing humidity.
  12. windspeed:

    • Numeric variable representing the wind speed.
  13. casual:

    • Numeric variable representing the number of casual bikers.
  14. registered:

    • Numeric variable representing the number of registered bikers.
  15. bikers:

    • Numeric variable representing the total number of bikers (sum of casual and registered bikers).
library(ISLR2)

mydata <- ISLR2::Bikeshare

To see the different types of the data we use this code

str(mydata)
'data.frame':   8645 obs. of  15 variables:
 $ season    : num  1 1 1 1 1 1 1 1 1 1 ...
 $ mnth      : Factor w/ 12 levels "Jan","Feb","March",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ day       : num  1 1 1 1 1 1 1 1 1 1 ...
 $ hr        : Factor w/ 24 levels "0","1","2","3",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ holiday   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ weekday   : num  6 6 6 6 6 6 6 6 6 6 ...
 $ workingday: num  0 0 0 0 0 0 0 0 0 0 ...
 $ weathersit: Factor w/ 4 levels "clear","cloudy/misty",..: 1 1 1 1 1 2 1 1 1 1 ...
 $ temp      : num  0.24 0.22 0.22 0.24 0.24 0.24 0.22 0.2 0.24 0.32 ...
 $ atemp     : num  0.288 0.273 0.273 0.288 0.288 ...
 $ hum       : num  0.81 0.8 0.8 0.75 0.75 0.75 0.8 0.86 0.75 0.76 ...
 $ windspeed : num  0 0 0 0 0 0.0896 0 0 0 0 ...
 $ casual    : num  3 8 5 3 0 0 2 1 1 8 ...
 $ registered: num  13 32 27 10 1 1 0 2 7 6 ...
 $ bikers    : num  16 40 32 13 1 1 2 3 8 14 ...

If you want to convert numeric variables in a data frame to floating-point numbers with a specific level of precision (number of decimal places), you can use the round(), as.numeric(), as.date().

In this example mnth, day, hr are not being understood by R as dates or times. Depending on the type of analysis this may or may not be troublesome. For example if we wanted to work with time series this data structure would not be possible unless they were in an appropriate format. To convert these to different structures we would use a similar format as below:

mydata$atemp <- as.integer(mydata$atemp)


mydata$temp <- as.integer(mydata$temp)


mydata$bikers <- as.integer(mydata$bikers)
  1. The ‘mnth’ column is converted to the Date format using as.Date().
  2. The ‘day’ column is also converted to the Date format using as.Date().
  3. The ‘hr’ column is converted to the POSIXct format using as.POSIXct().

These conversions are helpful when working with temporal data, allowing for appropriate handling and analysis of dates and times in R.

To see the new structure of the data after conversion

str(mydata)
'data.frame':   8645 obs. of  15 variables:
 $ season    : num  1 1 1 1 1 1 1 1 1 1 ...
 $ mnth      : Factor w/ 12 levels "Jan","Feb","March",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ day       : num  1 1 1 1 1 1 1 1 1 1 ...
 $ hr        : Factor w/ 24 levels "0","1","2","3",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ holiday   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ weekday   : num  6 6 6 6 6 6 6 6 6 6 ...
 $ workingday: num  0 0 0 0 0 0 0 0 0 0 ...
 $ weathersit: Factor w/ 4 levels "clear","cloudy/misty",..: 1 1 1 1 1 2 1 1 1 1 ...
 $ temp      : int  0 0 0 0 0 0 0 0 0 0 ...
 $ atemp     : int  0 0 0 0 0 0 0 0 0 0 ...
 $ hum       : num  0.81 0.8 0.8 0.75 0.75 0.75 0.8 0.86 0.75 0.76 ...
 $ windspeed : num  0 0 0 0 0 0.0896 0 0 0 0 ...
 $ casual    : num  3 8 5 3 0 0 2 1 1 8 ...
 $ registered: num  13 32 27 10 1 1 0 2 7 6 ...
 $ bikers    : int  16 40 32 13 1 1 2 3 8 14 ...